A Study on the Financial health of the residents og Engagement, Ohio, USA.
In this Take Home exercise, we strive to answer the following questions about the financial health of the residents of Engagement, Ohio, USA.
How does the financial health of the residents change over the period covered by the dataset?
How do wages compare to the overall cost of living in Engagement?
Are there groups that appear to exhibit similar patterns?
The dataset used is available for download here. Out of the data available, we will be using the Financial Journal to understand the income and expense of the participants over a period of time.
We will first load the required packages using the below code chunk
packages = c('tidyverse','ggdist','gghalves','ggthemes','hrbrthemes','ggridges','patchwork','zoo', 'ggrepel','ggiraph','lubridate','gganimate','scales')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports FinancialJournal.csv and
Participants.csv from the data folder by using read_csv()
function of readr
package into R and save it as a tibble dataframe called
financial_data.
financial_data <- read_csv("data/FinancialJournal.csv")
participants_data <- read_csv("data/Participants.csv")
In the below code chunk, financial data of participants are grouped by Month and Aggregated to find their monthly Income and Expenses. We will also calcualte the savings of the participants. The data is then joined with Participants data so that we have information about participants education level and other details.
level_order <- c('Graduate','Bachelors','HighSchoolOrCollege','Low')
participant_fin <- financial_data %>%
mutate(date = as.yearmon(timestamp)) %>%
group_by(participantId, date) %>%
summarise(income = sum(ifelse(amount > 0,amount,0)), expense = sum(ifelse(amount <= 0,amount,0))) %>%
mutate(savings = round(income + expense,digits = 0)) %>%
inner_join(participants_data, by =c('participantId'))%>%
mutate(educationLevel = factor(educationLevel, levels = level_order))
We will further wrangle the data to find the average income of participants with different Education Levels.
From the below plot, we can see that the participants with Low education have lower Income where as participants with Graduate and Bachelor level education has higher Income. It is observed that there is a major dip in Income in April, 2022 and Feb, 2023 where as the income is high in March,2022, Aug,2022 and March 2023.
participant_fin %>%
ggplot(aes(x=date, y = income, group =participantId, color =educationLevel))+
geom_line_interactive(size = 0.6)+
ylab("Income")+
xlab("Month, Year")+
theme(axis.title.y=element_text(angle =0, margin = margin(t=-50,r=-20)),
axis.title.x=element_text(margin = margin(t=-10)),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_line(colour = "grey90"),
panel.grid.major.x = element_line(colour = "grey90"),
panel.grid.major.y = element_line(colour = "grey90"),
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(size =16, angle = 45, margin = margin(t = 30)),
axis.text.y = element_text(size =16),
axis.line = element_line(color="grey25", size = 0.02),
axis.title = element_text(size=16),
legend.title = element_text(size =16),
legend.text = element_text(size = 16),
plot.title = element_text(size =20,hjust = 0.5))+
ggtitle("Income of Participants by Education Level")
Since there are too many lines, let us add a tooltip to display the income and education level on hover
participant_fin$tooltip <- c(paste0(
"Id = ", participant_fin$participantId,
"\n Income = $", round(participant_fin$income,digits = 0),
"\n Education :",participant_fin$educationLevel))
p1 <- participant_fin %>%
ggplot(aes(x=date, y = income, group =participantId, color =educationLevel))+
geom_line_interactive(aes(tooltip = tooltip),size =0.4)+
ylab("Income")+
xlab("Month, Year")+
theme(panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_line(colour = "grey90"),
panel.grid.major.x = element_line(colour = "grey90"),
panel.grid.major.y = element_line(colour = "grey90"),
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(size =16, angle = 45, margin = margin(t = 30,r=30)),
axis.text.y = element_text(size =16),
axis.line = element_line(color="grey25", size = 0.2),
axis.title = element_text(size=16),
legend.title = element_text(size =16),
legend.text = element_text(size = 16),
plot.title = element_text(size =20,hjust = 0.5))+
ggtitle("Income of Participants by Education Level")
girafe(
ggobj = p1,
width_svg = 12,
height_svg = 12*0.618
)
Let us clean up this graph to see the average Income by Education Level
Education_fin %>%
ggplot(aes(x=date, y = income,col =educationLevel))+
geom_line(size = 0.75)+
ylab("Income")+
xlab("Month, Year")+
theme(axis.title.y=element_text(angle =0),
axis.title.x=element_text(margin = margin(t=-10)),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_line(colour = "grey90"),
panel.grid.major.x = element_line(colour = "grey90"),
panel.grid.major.y = element_line(colour = "grey90"),
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(size =16, angle = 45, margin = margin(t = 30)),
axis.text.y = element_text(size =16),
axis.line = element_line(color="grey25", size = 0.02),
axis.title = element_text(size=16),
legend.title = element_text(size =16),
legend.text = element_text(size = 16),
plot.title = element_text(size =20,hjust = 0.5))+
ggtitle("Average Income by Education Level")
Let us now try to visualize both income and expense of different Education Levels on the same graph
Education_fin %>%
ggplot(aes(x=date))+
geom_line(aes(y= income,col =educationLevel),size = 0.75)+
geom_line(aes(y = abs(expense), col =educationLevel ),size = 0.75,linetype = 2)+
scale_color_manual('Education Level', values = c('blue','green','red','black'))+
scale_linetype_manual(name='Income/Expense',values = c(1,2), labels = c('income','expense'))+
scale_y_continuous(name = "Income",
sec.axis = sec_axis(trans = ~.,name="Expense"))+
xlab("Month, Year")+
theme(axis.title.y=element_text(angle =0),
axis.title.x=element_text(margin = margin(t=-10)),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_line(colour = "grey90"),
panel.grid.major.x = element_line(colour = "grey90"),
panel.grid.major.y = element_line(colour = "grey90"),
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(size =16, angle = 45, margin = margin(t = 30)),
axis.text.y = element_text(size =16),
axis.line = element_line(color="grey25", size = 0.02),
axis.title = element_text(size=16),
legend.title = element_text(size =16),
legend.text = element_text(size = 16),
plot.title = element_text(size =20,hjust = 0.5))+
ggtitle("Average Income and Expense by Education Level")
One challenge faced for this graph is to add a second legend while using dual y axis. Also, the x axis tick labels does not show all the levels.
From the above plot, we can see that the expenses(dotted lines) are more or less same for different education levels and there is not much variation. This means that the savings will be proportional to the income and we will analyse this using scatterplot which is a better choice for continuous data.
For this we will add a column for savings to the dataset which is the difference between the income and expense.
participant_fin$tooltip <- c(paste0(
"Id = ", participant_fin$participantId,
"\n Savings = $",participant_fin$savings))
p2 <- participant_fin %>%
filter(date == 'Apr 2022') %>%
ggplot(aes(x=income, y = abs(expense), size = savings, color = educationLevel))+
geom_point_interactive(aes(tooltip = tooltip), alpha=0.7) +
ggtitle("Income vs Expense by different Education Levels") +
ylab("Expense") +
xlab("Income")+
theme_minimal() +
theme(axis.line = element_line(size = 0.5),
axis.text = element_text(size = 16),
axis.title = element_text(size=16),
axis.title.y = element_text(angle = 0),
legend.title = element_text(size =16),
legend.text = element_text(size = 16),
plot.title = element_text(size =20,hjust = 0.5))
girafe(
ggobj = p2,
width_svg = 16,
height_svg = 16*0.618
)
participant_fin %>%
filter(date >= 'Apr 2022') %>%
transform(date = as.Date(date, frac = 1)) %>%
ggplot(aes(x=income, y = abs(expense), size = savings, color = educationLevel))+
geom_point(alpha=0.7) +
ggtitle("Income vs Expense by different Education Levels") +
ylab("Expense") +
xlab("Income")+
theme_minimal() +
theme(axis.line = element_line(size = 0.5),
axis.text = element_text(size = 16),
axis.title = element_text(size=16),
axis.title.y = element_text(angle = 0),
legend.title = element_text(size =16),
legend.text = element_text(size = 16),
plot.title = element_text(size =20,hjust = 0.5))+
labs(title ='Period : {frame_time}')+
transition_time(date)+
ease_aes('linear')